Improving Text Categorization by Using a Topic Model

نویسنده

  • Wongkot Sriurai
چکیده

Most text categorization algorithms represent a document collection as a Bag of Words (BOW).The BOW representation is unable to recognize synonyms from a given term set and unable to recognize semantic relationships between terms. In this paper, we apply the topic-model approach to cluster the words into a set of topics. Words assigned into the same topic are semantically related. Our main goal is to compare between the feature processing techniques of BOW and the topic model. We also apply and compare between two feature selection techniques: Information Gain (IG) and Chi Squared (CHI). Three text categorization algorithms: Naïve Bayes (NB), Support Vector Machines (SVM) and Decision tree, are used for evaluation. The experimental results showed that the topic-model approach for representing the documents yielded the best performance based on F1 measure equal to 79% under the SVM algorithm with the IG feature selection technique.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA

With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...

متن کامل

Text categorization using topic model and ontology networks

Text categorization based on pre-defined document categories is one of the most crucial tasks in text mining applications in recent decades. Successful text categorization highly relies on the text representations generated from documents. In this paper, an innovative text categorization model, VSM_WN_TM, is presented. VSM_WN_TM is a special Vector Space Model (VSM) that incorporates word frequ...

متن کامل

Summary of Text Categorization based on Maximum Entropy Model

Since 1990s, the maximum entropy model has been used in text categorization and achieves good results in Natural Language Processing since its framework and algorithm were established. On the basis of the Maximum Entropy Model, scholars improve it and make a more in-depth study. Using Maximum Entropy Model for text sentiment categorization has become a hot research topic in recent years. In thi...

متن کامل

Improving Domain Dictionary-based Text Categorization Using Self-partition Model

In this paper, we present a novel model for improving the performance of Domain Dictionary-based text categorization. The proposed model is named as Self-Partition Model(SPM). SPM can group the candidate words into the predefined clusters, which are generated according to the structure of Domain Dictionary. Using these learned clusters as features, we proposed a novel text representation. The e...

متن کامل

A Survey Paper On Naive Bayes Classifier For Multi-Feature Based Text Mining

Text mining is variance of a field called data mining. To make unstructured data workable by the computer Text mining is used which is also referred as “Text Analytics”. Text categorization, also called as topic spotting is the task of automatically classifies a set of documents into groups from a predefined set. Text classification is an essential application and research topic because of incr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011